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Abstract . In many programs solving difference equations, problem size 
Is restricted by the number of available memory cells. A strategy has 
been developed to permit trade-offs between the number of floating 
point operations required and the storage requirements for the solution 
of certain problems, such as block tridiagonal systems of equations. 

This is done by recomputing some Intermediate results Instead of storing 
them. Reducing the storage to the square root of the current require- 
ment will roughly double the number of computations. Reducing the 
storage more than this tends to make the number of computations prohib- 
itively large. In^ theory though. If m Is the order of each sub-matrix 
in the block trldlagonal matrix, one can solve any linear system with 
only 5m^ + 1 temporary storage cells. In many cases m is a constant 
and quite small. For example. In solving a factored form of the three- 
dimensional Navler-Stokes equations, the size m of the block tri- 
diagonals Is 5. In fact, for block tridlagonals arising from finite 
difference solutions of equations of fluid flow, m is rarely more 
than 5. This method lends itself to efficient use on computers with 
parallel processing or vector processing architectures. On these com- 
puters the larger number of floating point operations Is more than 
offset by the decrease In I/O and the Increased percentage of vector 
operations made possible by this algorithm. 
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1. Introduction 


The most widely used algorithm for solving general systems of 
linear equations is Gaussian elimination. Other methods have appeared 
which take advantage of the structure of certain problems, i.e., cyclic 
reduction for banded matrices with constant coefficients. Most methods 
thus far devised have had the objective of reduclng-the total number of 
floating point operations. The method described here does not. 

Our basic objective Is to minimize the overall time and cost of 
solving a given problem. When computers were slow and problems were 
small the way to do this was to minimize arithmetic. With the advent 
of supercomputers, however, other considerations have become important. 

One such consideration Is the ability to vectorize an algorithm. 

For a given computer, the speed of the vector hardware may exceed that 
of the scalar hardware by a factor of ten or more. A common way of 
finding long vectors in a trldlagonal solver is to solve many trldlag- 
onals at once. vector length then becomes the number of simultane- 

ously solved systems. The storage requirenents of such an approach 
exceed those of the scalar approach by a factor of the vector length. 

Another consideration is the time spent in communication with 
secondary memory. The speed of the arithmetic units makes it possible 
to solve, in a reasonable amount of time, problems whose storage 
requirements exceed the capacity of primary memory. On most computers 
it is difficult 'to overlap the transfer time between primary and second- 
ary memory. This becomes the dominant cost in some cases. In addition, 
programs which use secondary memory are often significantly more complex 
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than those which do not. Furthermore, data transfers between memory 
levels are hardware-dependent. Programs which do explicit transfers 
between memory levels are not portable for this reason. Realizing 
this, we turn our efforts toward an algorithm that requires less memory, 
even if it requires more arithmetic. 

Recomputation is such an algorithm. It offers the user a trade-off 
between the number of arithmetic operations, time spent in scalar com- 
putation, and time spent on data transfers to secondary memories. A 
FORTRAN subroutine has been written which utilizes recomputation in the 
solution of block tridiagonal systems. The user can specify, with one 
parameter, exactly how much storage is available in primary memory and 
the subroutine will minimize arithmetic subject to this constraint. 

If there is enough storage the algorithm reduces to the standard, non- 
recomputing case. The minimum allowable space, not counting the solu- 
tion vector, is 5 storage blocks, each an m x n matrix, plus one word. 
In the sense that this is Independent of N, the number of block 
unknowns, we say that there are no storage constraints. 

2. Method 

The algorithm described here performs Gaussian elimination to solve 
a tridiagonal system of equations. This is equivalent to the Thomas 
algorithm [1]. We will deal only with scalar tridlagonal matrices, the 
extension to block tridlagonal matrices being relatively straightforward 
The notation used is defined by the following equations: 


5 





The successive steps which are normally taken to solve this problem 
by Gaussian elimination are: 

forward elimination 
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The storage problems with this method stem from the fact that one 
must compute all of the elements c’ and q’ before any of the elements 
q can be computed. The right-hand side is usually overwritten with the 
solution so that the same storage cell is occupied at various times by 
Tf, qj^, and finally q^. Traditionally, both the c’ vector and the 
right-hand side are stored for a total of 2n-l storage cells (a total 
of (n-l)m^ + nm for a block tridiagonal) . 

Notice that to compute only need and Also, to 

compute we need only ^be matrix elements and 
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Schematically, this is shown in Fig. 1. The dashed boxes contain the 
original matrix entries. In many applications these require no storage, 
since they are either analytically known or can be recovered from other 
Information contained in memory. We consider such applications here. 

The main consequence of this simplification is that if any element of 
the decomposition is known (i.e., b' or c’), then the forward elimina- 
tion can be reinitiated at that point to get any subsequent decomposi- 
tion element. This is the basis for the whole scheme. We save a few, 
selected, elements c' on the forward elimination and then execute the 
following sequence: 

a. Execute the backward substitution in a conventional manner as 
far as possible. 

b. When an element c^ is needed and not available, recompute it 
by reinitiating the forward elimination starting with a stored 
element c*. The best choice is that element whose index is 
highest without exceeding 1. All elements c' with higher 
indexes have been used already. These may be overwritten to 
save other Indexes of c'. If no stored data remains, recom- 
pute from index 1. The element c| is zero. 

When the needed element has been recomputed resume step a. 
Notice that the forward elimination in step b is analogous to the 
original forward elimination. The starting and ending indexes are 
different, as is the amount of available storage, but the form is the 
same. Thus, we may use the same selection process to decide which ele- 
ments to save in steb b that we did on the original forward elimination. 
What follows is a description of and rationale for one such selection 
process. 
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Before any computing is done the available storage is divided into 
two arrays. The first array is for the temporaries. It is dimensioned 
C(IVEC,M,M,KMAX) . The first subscript, t^lcally 64, is the vector 
length which is also the number of simultaneously solved block trldlag- 
onal systems. The next two subscripts are for referencing Inside each 
block. The last index is the block number. This is the only subscript 
used by the selection process since all the trldlagonals are solved in 
parallel. In this discussion it is Indexed by the variable K. Since 
on a given call to the selection routine, some of the array may be 
occupied, the index KMIIT is needed. This is the lowest index of an 
unoccupied block. All storage cells are addressed by C(IVEC,M,M,K) 
such that KMIN^K^KMAX may be overwritten. 

The second array contains pointers and is dimensioned IH(KMAX) , 
Indexed in the same way as the first array. The value of IH(K) is 
the integer I corresponding to the element c^ which occupies 
C( , , ,K) . The inputs to the selection algorithm are: 

IL - The index of the first element c' to be recomputed. 
lU - The index of the last element c' to be. recomputed. 

KMIN - An index to the pointer array IH. Lower indexes may 
not be used. 

KMAX - The size and largest allowable index of the pointer 
array IH. 

IH - The pointer array. On output it contains indices of 

elements to save on the subsequent forward elimination. 

The motivation behind choosing the following selection process is 
to minimize the number of recomputations subject to the storage 
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constraints. To this end we remember the cost of computing a given 
temporary and avoid overwriting any temporary with one that Is , cheaper . 

First, we naively assume that a single computation will be. enough. 
Thus we set IH(KMIN)»IL, IH (KMIN+1) »IL+1 , and so on up to IH(KMAX) . 

If IH (KMAX).^IU then our assumption is correct and the selection 
process terminates. In this case the recomputing algorithm reduces to 
the Thomas algorithm. 


If IH(KMAX)<IU then some recomputing is necessary. The tempo- 
raries corresponding to Indexes in the pointer array are all equally 
expensive to recompute in the following sense: they all may be recov- 

ered with one computation per element by restarting the forward elimina- 
tion (using the fact that cj^ « 0) . Thus we may overwrite them all. 

This may be noted by adding IH(KMAX)-IH(KMIN)+1 to each element in 
the pointer array. If this would yield IH(K^^X}>IU we should add 
lU-IH(KMAX) Instead, so that IH(KMAX)»IU. Thus, the amount to be 
added is MIN(IH(KMAX)-IH(KMIN)+1,IU-IH(KMAX)) . At this point, if 
IH(KMAX)sIU the selection process is completed. 

Otherwise we must overwrite further, picking the indexes, which are 
least expensive to recover. This time they are not all equally inex- 
pensive to recompute. Recomputing ‘^iH(gj,IIN) require three 

computations of c^^. This being the case, we choose not to overwrite 
IH(KMIN) yet. The temporaries corresponding to other elements of the 
pointer array can be computed at a total cost of two computations if 
'^IH(KMIN) used as a starting point. We define here the variable 
KP, initially KMIN but now incremented to KMIN+1. The rationale of 
the previous paragraph is used to justify adding MIN(IH(KMAX)-IH(KP)+1, 
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lU-IHCKMAX)) to all IH(K) such that KP^K^KMAX. As before, if after 
this is done IH(KMAX)=IU, then the selection process is completed. 

If IH(KMAX)/XU, then we can use the above argument to show that com- 


'IH(KP) 


would be relatively expensive, implying three computa^ 


tions for respond by incrementing KP and repeating 

the process until either the process is completed (l.e., IH(KHAX).=>iu) 


or KP exceeds KMAX. 


The second case states that even if one is willing to compute 
everything twice there is not enough storage. The temporaries corres- 
ponding to the elements in the pointer array would all cost two compu- 
tations to recompute. Consequently, to complete the selection process 
we have to be willing to compute some elements three times. In terms 
of the variables in the selection process, this means resetting KP to 
KMIN. Otherwise everything is the same. Simply keep adding 
MIN(IH(KMAX)-IHCKP)+1,IU-IH(KMAX)) and incrementing or resetting KP 
as; appropriate until IH(KMAX)=IU. A simple FORTRAN subroutine to 
accomplish this algorithm in less than 20 executable statements is given 
in the appendix. We will now illustrate the selection algorithm and 
the recomputing scheme by an example. 


Example 

Suppose n = 11 and there is only room to store 3 temporaries. The 
conventional Thomas algorithm requires 10 temporaries. The selection 
algorithm would proceed as follows: 
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1. Initially IL=2,IU=11,KMIN»1,KMAX»3 

2. Naively assume that no recomputation is required. Set 
IH(1)»2 

IH(2)=3 ' 

IH(3)=4 

3. Set KP=KMIN»1. Then MINCIH(KMAX)-IH(KP)+1-,IU-IH(KMAX)) 

=MIN(3,7)»3 

Adding this to all IH(K) from K-KF to K=KMAX gives 

IH(1)»5 

IH.(2)»6 

IH(3)=7 

4. Increment KP so that KP»2. NOW MIN(IH(KMAX)-IH(KP)+1, 

lU- IH (KMAX) ) =MIN ( 2 , 4) « 2 

Adding this to all IH(K) from K»KP to K«KMAX gives 

IH(1)=5 

IH(2)=8 . ... 

IH(3)=9 

.5. Increment KP so that KP»3. Now MIN(IH(KMAX)-IH(KP)+1, 

■ V 

lU-IH(KMAX) ) =MIN(1 , 2) =2 

Adding this to all IH(K) from K=KP to K=KMAX gives 

IH(1)=5 

IH(2)=8 

IH(3)=10 
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6. 


Incranent K7 so that KFa4. Since this exceeds KMAX w'e reset 
it so that KP»1. Now MIN(IH(KMAX)-IH(KP)+1, 
lU-IH(KMAX) ) =MIN ( 6 , 1) 

Adding this to all IH(K) from K»KP to K=KMAX gives 
IH(1)=6 
IH(2)=9 
IH(3)=11 

7. Since [IH(KMAX)“IU] = [IH(3)=11], the selection process is completed. 
It says that we should save the elements c^, c^, and cj^^ on the 
forward elimination. 

The recomputing algorithm would proceed as follows: 

1. Initial forward sweep. Save Cg, c^, and 

2. Backward sweep. Comput and To compute we 

require cj^g. 

3. Since q^g is already computed, c|^ is not needed. Resume 
forward sweep using c^ and overwrite c^j^ with c|g. 

A. Continue the backward sweep, using cj^g and c^ to compute 
qg and qg. 

5. Resume forward sweep with Cg, overwriting c^ and cj^g with 
c^ and Cg. 

6. Continue the backward sweep, computing q^, qg, and q^. 

7. Resume forward sweep from index 1, overwriting Cg, c^, and 
Cg with Cy c^, and Cy 

8. Continue backward sweep by computing q 2 , q^* and q^. 


13 



9. Again resume forward sweep from index 1, overwriting with 
10. Conclude the backward sweep by computing qj^. 

Each forward sweep except the last used all the storage containing ele- 
ments that were no longer needed. Temporaries Cg, c^, and cj^^ were 
computed only once. Temporary c^ was computed three times.. All the 
rest were computed twice. The total cost was almost twice that of the 
conventional Thomas algorithm, yet the required storage was less than 
the square root of that required by the conventional algorithm. In 
larger problems it is often possible to reduce the storage requirements 
by a factor of ten while only doubling the arithmetic, a paying proposi- 
tion if I/O is expensive. It is interesting to note that the minimum 
required storage space is five blocks, each an m x m. matrix plus one 
word. This is extremely expensive, however, the computational effort 
being higher than the nonrecomputing case by a factor of roughly n^/2. 
Three of the blocks are needed for the block elements A, B, and C. 

One is needed for the intermediate B' and the last is needed for the 
temporary C'. One additional word is needed for the pointer array IH 
to keep track of the one temporary. The flowchart in Fig. 2 Illustrates 
how the selection process and the recomputation algorithm fit into the 
Thomas algorithm. 

3. Discussion 

The standard Thomas algorithm has the following operation count. 

Multiplications N(7m^/3 + 3m^ - m/3) - 2m2(m+l) 

Additions N(7m^/3 + 3m^/2 - 5m/6) - 2m2(in+l) (5) 

Divisions Nm . 
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Here N Is the number of block unknowns and m Is the dimension of 
each block. If N >> 1 the second term in the addition and multiplica- 
tion counts may be neglected. Using as a measure of relative cpu time 
the equivalency formula 

1 add » 1 multiply ■ 1/4 divide (6) 

the total number of operations becomes 

N{14m3/3 + 9m2/2 + 17m/6} . (7) 

The variable T is defined as the total number of operations divided 
by the quantity in brackets. In this way the dependence on m of the 
results is largely removed, the variable KMAX, defined above, is a 
measure of the available storage. Finally N, also. defined above, is a 
measure of the problem size. Figure 3 plots T vs N for various values 
of the parameter KMAX. This figure describes the common situation in 
which a computer has a limited total memory. This occurs when secondary 
memory is much slower than primary memory. Often the secondary memory 
is a disk or a standard tape drive. In the case of the current genera- 
tion of micro-computers, it may even be a cassette. In such a case 
recomputing may allow the solution of problems that otherwise could not 
be solved at all. Figure 3 gives, at a glance, the cost of solving 
block tridiagonal systems as a function of problem size given a fixed 
amount of memory. 

Another situation for which recomputing can be helpful is where a 
program has been written, the problem size is fixed, and the user wishes 
to modify the program in some way that requires, more memory than the 
computer has. One way to get more memory is to reduce the amount of 
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space allocated for temporaries used In solving block trldlagonals. 
Depending on the problem this may free a significant amount of storage. 
In this way the user may avoid a complete rewrite which might otherwise 
be necessary to Incorporate transfers to secondary memory. Depending 
on the accounting algorithm for the computer In question, recomputing 
may even be cheaper than transfers to secondary memory. Experience has 
shown, however, that recomputing, rarely pays on a cost/run basis' If any 
element c' Is computed more than twice. 

A situation sometimes arises In which the total computer time Is 
fixed. From this constraint one may estimate the maximum number of 
times each element may be computed. We call this number P. Given P 
and the storage constraint KMAX there is a limit to the number of 
block unknowns we can solve for. We call this number NMAX. The 
question arises: What Is the relationship between P,KMAX, and NMAX? 
Such Information could be useful In deciding on a vector length or. 
deciding If this algorithm would pay at all. It can be shown that the 
recursion relation for finding NMAX(P,KMAX) Is 

NMAX(P, 1) » P + 1 (8a) 

NMAX(1,KMAX) « KMAX + 1 (8b) 

NMAX (P, KMAX) = NMAX (P , KMAX- 1) + NMAX (P-1, KMAX) (8c) 

We notice immediately that NMAX (P, KMAX) »NMAX (KMAX, P), that Is, the 
function NMAX Is symmetrical about the line P-KMAX. Also, along a 
line where P (or KMAX) is a constant, the values of NMAX may be 
exactly fitted by a polynomial of degree P (or KMAX) . The first few 
and the general case are given here. 
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P=1 

NMAX = 

1 + 

KMAX 

(9a) 

P-2 

NMAX - 

1 + 

3/2 KMAX +1/2 KMAx2 

(9b) 

P-3 

NMAX = 

1 + 

11/6 KMAX + KMAx2 + 1/6 KMAX^ 

(9c) 

P-z 

NMAX = 

z+1 

L 

(l (S^^)|*KMAX“‘^/Z!) 

(9d) 




In the general case, are Stirling numbers of the first kind. 

Thus we see that the highest order term is always KMAX^/zI. 

Equation (9d) is given without proof. In principal, however, one could 
substitute Eq. (9d) into (8c) to prove the equality. 

The recomputation algorithm is arithmetically the same as the 
Thomas algorithm; hence, it has exactly the same stability properties 
and gives the same answer. Although the storage overhead is usually 
negligible it does require KMM scalar temporaries. This should be 
compared with KMAX*m^*IVEC temporaries used in the rest of the compu- 
tation or with (N-l)*m2*IVEC temporaries needed for the Thomas algo- 
rithm. The computational overhead is equally negligible, involving 
less than N floating point operations. If no recomputation is done, 
the recomputation algorithm costs virtually the same to use as a con- 
ventional Thomas algorithm. 


4. Conclusions 

It haa proved useful to program the entire block tridiagonal solver 
as a subroutine which has, as an argument, the amount of available 
space. This substantially reduces the consequences of programming at 
the limit of primary memory. This alone helps increase productivity 
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through reducing the number of times programs are rewritten to free 
a tiny amount of storage. 

Using recomputation, problem size can be substantially Increased 
on computers where memory size is poorly matched to processor speed for 
this type of problem. Recomputation can free enough memory to allow 
the effective use of vector processing capabilities on machines like 
the Cray 1 and the CDC 7600. Furthermore, the extra computation 
required is largely made up of dot products, at which these machines 
are very efficient. 

Surprisingly, execution speed can actually be Increased through 
the use of recomputation. This can occur when disk latency becomes a 
substantial portion of the code's running time. It can also occur 
when recomputation is used to increase' vector lengths. In both cases 
costs can be reduced by doing more arithmetic, keeping the job in core 
using recomputation. This algorithm was used on Illlac IV codes at 
Ames Research Center from 1977 until the Illlac was replaced in 1981 
[2], The required vector length of 64 and the small size of primary 
memory made recomputation a virtual necessity on the Illlac, allowing 
the solution of problems that otherwise could not have been solved. 

Any time a situation arises where it costs more to bring a problem in 
and out of core, than it does to perform the arithmetic, or where many 
problems must be solved in parallel, recomputation is likely to pay. 

Of course, this general approach is not limited to tridiagonal 
matrices. It can easily be extended to cover periodic and pentadiag- 
onal matrices. For that matter, it can be used to solve dense or -wide- 
banded matrices. It is not limited to Gaussian elimination but is 
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applicable to any method with a forward and backward sweep that saves 
Intermediate results. 

Larger and faster memories may temporarily reduce the need for 
rficomputatlon but cannot remove Its advantages. As long as there are 
substantial differences In speed between memory hierarchies, computers 
that don't match memory to processor speeds, computers with vector 
speeds significantly higher than scalar speeds, or problems which are 
computationally light relative to their size, there will be a need for 
recomputation schemes. ' 
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Appendix 


SUBROUTINE CHOOSE(IL,IU,KMIN,KMAZ,IH) 

THIS SUBROUTINE IS DESIGNED TO CHOOSE THE OPTIMUM INDICES 
AT WHICH TO SAVE DECOMPOSITION ELEMENTS C^ 

ARGUMENT DESCRIPTION 

IL INPUT-THE LOWEST INDEX FOR WHICH C'(T) MUST BE COMPUTED. 

lU INPUT-THE HIGHEST INDEX FOR WHICH C'(I) IS NEEDED. 

KMIN INPUT-C(KMIN-l) CONTAINS THE DECOMPOSITION ELEMENT 

NEEDED TO RESTART THE FORWARD SUBSTITUTION. INITIALLY 
KMIN-1. 

KMAX INPUT-THE TOTAL NUMBER OF ELEMENTS C'(I) WHICH CAN BE 

STORED. ON THIS CALL TO CHOOSE WE CAN PICK UP TO 
KMAX-KMINfl ELEMENTS. 

IH OUTPUT-AN ARRAY OF LENGTH KMAX WHICH CONTAINS THE 

INDICES OF THE ELEMENTS C'(I) WE WISH TO STORE. 

DIMENSION IH(KHAX) 

DO 10 R - KMIN, KMAX 
IH(K)-K-KMIN+IL 
10 CONTINUE 

THIS BRANCH IS TAKEN IF THERE IS SUFFICIENT SPACE TO 
AVOID RECOMPUTING. 

IF (IH(KMAX) .GT. lU) GOTO 6 


KK-KMIN 

2 IF ( (2*IH(KMAX)+1-IH(KK)) .GE. lU ) GOTO 4 

THIS SECTION IS EXECUTED IF OVERWRITNG ALL THE CHEAP 
, SPACE IS NOT SUFFICIENT. 

KC- IH( KMAX)- IH( KKHl 
DO 3 K - KK.KMAX 
IH(K)-IH(K) + KC 

3 CONTINUE 

PRINT*, (IH(J).J-l, 20) 

KK-KK+1 

WHEN THIS TEST PASSES ANOTHER LEVEL OF RECOMPUTATION IS 
REQUIRED. 

IF (KK .GT. KMAX) KK-KMIN 
GOTO 2 

THIS SECTION IS EXECUTED SO THAT IH(KMAX) IS lU. IT 
ALSO ENSURES THAT IF SOME ELEMENTS MUST BE RECOMPUTED 
MORE TIMES TRAN OTHERS THEY ARE DONX LAST, WHEN THERE 
IS MORE SPACE. 

4 CONTINUE 
IC-IU-IH(KMAX) 

DO 5 K - KK,KMAX 

IH(K) - IH(K) + IC 

5 CONTINUE 

6 RETURN 
END 


20 



Fig. 1 
Fig. 2 
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Figure Captions 


. Data dependencies In a tridiagonal. 


. The Thomas algorithm with recomputation. 


. A summary of tradeoffs offered by recomputation. 
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